1
00:00:00,390 --> 00:00:05,930
Hi, everyone.

2
00:00:05,930 --> 00:00:09,380
Welcome back to the Heterogenious Parallel
Programming class.

3
00:00:09,380 --> 00:00:13,300
We are at lecture 1.8, Kernel-Based
Parallel Programming,

4
00:00:13,300 --> 00:00:17,260
and we will be discussing basic
matrix-matrix multiplication.

5
00:00:18,450 --> 00:00:21,400
The objective for this lecture is to
prepare

6
00:00:21,400 --> 00:00:25,290
you for the lab assignment of
matrix-matrix multiplication.

7
00:00:25,290 --> 00:00:30,810
And we will first review the computation
for matrix multiplication.

8
00:00:30,810 --> 00:00:36,350
And we will also discuss how we can do
data

9
00:00:37,370 --> 00:00:42,400
block and thread index to data index
mapping for this particular computation.

10
00:00:42,400 --> 00:00:46,060
And then we will introduce loop control
flow in kernels.

11
00:00:49,098 --> 00:00:53,980
Matrix multiplication is a fairly
frequently

12
00:00:53,980 --> 00:00:59,490
used computation in engineering and
science applications and the

13
00:00:59,490 --> 00:01:05,050
computation is fairly simple.
We have two input matrices A and B.

14
00:01:05,050 --> 00:01:08,535
And then we will generate a product matrix
C.

15
00:01:08,535 --> 00:01:14,610
The each C element is generated as a
product of a row of A and

16
00:01:14,610 --> 00:01:19,250
a column of B.
So we in this particular picture,

17
00:01:19,250 --> 00:01:24,480
we show that a C element has a coordinate
row

18
00:01:24,480 --> 00:01:29,360
at Col.
And so we would use a row in

19
00:01:29,360 --> 00:01:34,980
A with the index Row, and then we will
have a column

20
00:01:34,980 --> 00:01:39,610
of B with the index Col.
And we will take the

21
00:01:39,610 --> 00:01:44,460
row and the column to take, to do a
vector.product to

22
00:01:44,460 --> 00:01:49,660
generate the value of the C element.
In

23
00:01:49,660 --> 00:01:55,120
all the discussions that we have today, we
will assume that we would be doing

24
00:01:55,120 --> 00:02:00,270
a we will have A matrix which is of
dimension m times n.

25
00:02:01,450 --> 00:02:06,050
And B matrix will be of dimension n

26
00:02:06,050 --> 00:02:07,028
times k.

27
00:02:07,028 --> 00:02:14,330
And then we will generate a C matrix n
times k.

28
00:02:14,330 --> 00:02:19,460
So the n part is going to the number of
elements in

29
00:02:19,460 --> 00:02:24,530
a row and the number of elements in a
column, which is n of B.

30
00:02:25,570 --> 00:02:28,450
Which is represented in variable n here,
must

31
00:02:28,450 --> 00:02:31,430
match, because we're doing a .product so
we must

32
00:02:31,430 --> 00:02:33,960
have the same number of elements of A and
the

33
00:02:33,960 --> 00:02:38,220
same number of elements of B in that in
that .product.

34
00:02:38,220 --> 00:02:41,910
So we are going to use these variables

35
00:02:41,910 --> 00:02:45,190
consistently in all the examples in this
lecture.

36
00:02:48,180 --> 00:02:54,760
So, here we show a simple sequential C
code for matrix multiplication.

37
00:02:54,760 --> 00:02:58,530
And the C code will have have two levels
of loops.

38
00:02:58,530 --> 00:03:03,722
And the loop these two levels of loops
systematically go

39
00:03:03,722 --> 00:03:08,826
through all the C elements in all the row
positions and row,

40
00:03:08,826 --> 00:03:13,667
all the column positions.
And for every C element

41
00:03:13,667 --> 00:03:19,166
that it visits it would generate a top
product of, of the

42
00:03:19,166 --> 00:03:25,290
A, corresponding A row and B column for
that C element.

43
00:03:25,290 --> 00:03:30,400
So here is the C code for generating each
C element.

44
00:03:30,400 --> 00:03:33,600
We'll first initialize the c element to 0,
and then we will

45
00:03:33,600 --> 00:03:38,970
go and systematically visit all the
elements in the row of A and

46
00:03:38,970 --> 00:03:40,490
column of B.

47
00:03:40,490 --> 00:03:43,338
When we visit the row of A, we will

48
00:03:43,338 --> 00:03:48,110
first need to reached the beginning
location of that row.

49
00:03:48,110 --> 00:03:53,540
And the beginning location is is can be
found by multiplying

50
00:03:53,540 --> 00:03:58,700
variable row, by the number of elements in
each row, which is m.

51
00:03:58,700 --> 00:04:04,470
So, row times n moves us into the
beginning location of the row

52
00:04:04,470 --> 00:04:12,000
in the linear row major layout of A.
And then we do the same thing with B.

53
00:04:12,000 --> 00:04:17,370
In B's case, the beginning location of the
column is actually in row 0.

54
00:04:17,370 --> 00:04:23,700
So it's simply, a we can simply use Col to
move the

55
00:04:23,700 --> 00:04:29,699
you know the, to move the the, the index
into the beginning element of B.

56
00:04:29,699 --> 00:04:35,080
Now, with for every iteration of the I
loop, we will need to be able

57
00:04:35,080 --> 00:04:40,780
to calculate the corresponding element
within the row and column.

58
00:04:40,780 --> 00:04:43,710
So for A it's very simple.

59
00:04:43,710 --> 00:04:50,170
We all the elements in the in the same row
are going to be in adjacent locations.

60
00:04:50,170 --> 00:04:54,780
So all we have to do is to just add i to
the beginning index, and that

61
00:04:54,780 --> 00:05:01,130
will give us the the element, in the
property element for that iteration.

62
00:05:01,130 --> 00:05:05,747
For B it's a little bit more complex
because in row major layout

63
00:05:05,747 --> 00:05:10,753
all the elements in the column are going
to be in strided locations.

64
00:05:10,753 --> 00:05:15,768
So we need to multiply i by the number of
elements in each row, which is k

65
00:05:15,768 --> 00:05:21,280
for, for B and then we, we add that to the
beginning location.

66
00:05:21,280 --> 00:05:26,270
So these two memory accesses give us the
appropriate elements

67
00:05:26,270 --> 00:05:30,850
of A and B in each iteration.
And then we would do the multiplication.

68
00:05:30,850 --> 00:05:36,240
We accumulate the result into sum.
And this once we finish

69
00:05:36,240 --> 00:05:41,740
the for loop, we have the total value for
the C element.

70
00:05:41,740 --> 00:05:46,410
So now we can write the C element value
into the into the appropriate

71
00:05:46,410 --> 00:05:51,370
C location and this is the location at row
and column.

72
00:05:51,370 --> 00:05:56,176
So we again we will just calculate the
linearized address by

73
00:05:56,176 --> 00:06:01,960
multiplying row with k, and then add
column to the index.

74
00:06:01,960 --> 00:06:07,930
So this gives us the complete C sequential
code for matrix multiplication.

75
00:06:07,930 --> 00:06:12,070
Notice that this particular calculation is
a very

76
00:06:12,070 --> 00:06:15,590
simple and very basic form.

77
00:06:15,590 --> 00:06:20,540
And since matrix multiplication is a very
important computation, there are

78
00:06:20,540 --> 00:06:25,510
actually lots and lots of optimized ways
of doing sequential computation.

79
00:06:25,510 --> 00:06:27,330
We will come back to this point later on.

80
00:06:30,370 --> 00:06:35,560
So here we will begin to design a kernel

81
00:06:35,560 --> 00:06:40,460
function in CUDA to perform matrix
multiplication in parallel.

82
00:06:40,460 --> 00:06:44,110
So we're going to use a very small example
to to

83
00:06:44,110 --> 00:06:48,610
give you the intuition behind the design
of the kernel function.

84
00:06:48,610 --> 00:06:52,100
So here we're going to assume that we

85
00:06:52,100 --> 00:06:55,780
are going to be using a two dimensional
square,

86
00:06:55,780 --> 00:07:00,800
square block to organize the threads to
calculate C.

87
00:07:00,800 --> 00:07:04,230
And that is each thread block is going to
have

88
00:07:04,230 --> 00:07:07,430
equal number of threads in both X and Y
dimension.

89
00:07:07,430 --> 00:07:11,226
And obviously, this is a simplifying
assumption, that

90
00:07:11,226 --> 00:07:14,803
you can you can definitely use a
rectangular

91
00:07:14,803 --> 00:07:17,358
block organization, but here we use a, a

92
00:07:17,358 --> 00:07:21,890
square organization, just to keep the, the
slide simple.

93
00:07:21,890 --> 00:07:27,170
And so, assuming that we, we have a square
organization,

94
00:07:27,170 --> 00:07:33,080
we will just assume that we have a compile
time defined constant of tile width.

95
00:07:33,080 --> 00:07:34,950
And this tile width is going to give us
the

96
00:07:34,950 --> 00:07:38,233
number of threads in each dimension of the
thread block.

97
00:07:38,233 --> 00:07:47,390
And we we, we have a simple example here
of C to calculate

98
00:07:47,390 --> 00:07:55,230
a C matrix of 4 times 3 for a 4 by 3
matrix for C.

99
00:07:55,230 --> 00:07:59,402
So in this case, m equal to 4, and K equal
to 3.

100
00:07:59,402 --> 00:08:03,702
We also will further assume that n is 4
for both A and B

101
00:08:03,702 --> 00:08:08,830
and which is not visible in this light,
but will become useful later on.

102
00:08:08,830 --> 00:08:12,720
So for this very simple small example we

103
00:08:12,720 --> 00:08:18,700
have four elements in the Y dimension and
three elements in the X dimension of C.

104
00:08:18,700 --> 00:08:21,080
And we need to make sure that we generate

105
00:08:21,080 --> 00:08:24,830
enough thread blocks to calculate all the
C elements.

106
00:08:24,830 --> 00:08:30,280
If we assume that the tile width is 2 for
this simple example then we can, we will

107
00:08:30,280 --> 00:08:33,500
go through the same ceiling function
calculation and we

108
00:08:33,500 --> 00:08:36,920
would that we use before in the picture
kernel.

109
00:08:36,920 --> 00:08:38,120
Then we will,

110
00:08:38,120 --> 00:08:41,045
we will conclude that we need to have two
thread

111
00:08:41,045 --> 00:08:43,895
blocks in the Y dimension as well as two
thread

112
00:08:43,895 --> 00:08:46,670
blocks in the X dimension to be able to to

113
00:08:46,670 --> 00:08:50,734
cover the generation of, calculation of
all the C elements.

114
00:08:50,734 --> 00:08:56,521
So this give us the simple example that we
are going to working with.

115
00:08:56,521 --> 00:08:59,797
And in every thread block, we are going to
have

116
00:08:59,797 --> 00:09:03,409
two threads in the X dimension and two,
two threads

117
00:09:03,409 --> 00:09:09,625
in the Y dimension so that these threads
are going to be accessing the C

118
00:09:09,625 --> 00:09:15,837
elements that are located in the pile that
this thread block is responsible for.

119
00:09:15,837 --> 00:09:21,981
So now we are ready to to discuss the host
code for evoking the matrix

120
00:09:21,981 --> 00:09:28,797
multiplication kernel and we assume that
we have the, the matrix

121
00:09:28,797 --> 00:09:34,080
dimension variables m, n and k available
in the host code.

122
00:09:34,080 --> 00:09:40,460
So we will be generating in we will be
using the ceiling function expression

123
00:09:40,460 --> 00:09:48,870
based on k and n to generate enough thread
blocks for covering all the C elements.

124
00:09:48,870 --> 00:09:54,070
So this is just you know, very similar to
what we used in the picture kernel.

125
00:09:54,070 --> 00:09:58,130
And for each thread block we are going to
declare that we are going to

126
00:09:58,130 --> 00:10:02,918
have thread width and thread width number
of threads in the X and Y dimension.

127
00:10:02,918 --> 00:10:06,691
And we will because Z dimension is not
used,

128
00:10:06,691 --> 00:10:10,726
we're going to be just using 1 for that
dimension.

129
00:10:10,726 --> 00:10:14,506
So we will be launching the kernel

130
00:10:14,506 --> 00:10:19,651
with the grid dimension and block
dimension that

131
00:10:19,651 --> 00:10:23,326
we set up in the host code, and then we
will

132
00:10:23,326 --> 00:10:27,720
pass m, n, k and the pointers to A, B and
C.

133
00:10:27,720 --> 00:10:33,630
Just another reminder A matrix is going to
be m times n,

134
00:10:34,730 --> 00:10:41,920
the B matrix is going to be n times k, and
C matrix is going to be m times k.

135
00:10:46,940 --> 00:10:49,920
Here we show the matrix multiplication
kernel.

136
00:10:49,920 --> 00:10:55,023
And we, at the beginning, every thread is
going to determine the

137
00:10:55,023 --> 00:11:00,650
row and column index of the C element that
it's it's calculating.

138
00:11:00,650 --> 00:11:04,120
And then they will all proceed to test
whether the

139
00:11:04,120 --> 00:11:07,810
row index and column index are within the
valid range.

140
00:11:07,810 --> 00:11:12,490
If these indices are within the valid
range, then we will

141
00:11:12,490 --> 00:11:17,890
do, go through that for loop to access the
elements of the row

142
00:11:17,890 --> 00:11:22,880
of A and the elements of the column of B
to produce the inner product.

143
00:11:22,880 --> 00:11:26,810
So this is pretty much the same as the
sequential code.

144
00:11:26,810 --> 00:11:32,350
And after the for loop, we, you know, we
have the C element value.

145
00:11:32,350 --> 00:11:38,130
So we will, we write a C element value
into the global memory, and

146
00:11:38,130 --> 00:11:43,220
the calculation is exactly the same as C.
So this is the reason why we took

147
00:11:43,220 --> 00:11:48,237
the time to discuss the row major layout
and the linearization of variables.

148
00:11:48,237 --> 00:11:52,917
You know, keep in mind that in, in CUDA
because A, B and C

149
00:11:52,917 --> 00:11:57,584
are all going to be dynamically allocated
with cudaMalloc.

150
00:11:57,584 --> 00:12:03,317
So that's why we need to you know, do the
linearization in

151
00:12:03,317 --> 00:12:09,782
the of these indices when we read from A
and B and right into C.

152
00:12:09,782 --> 00:12:12,914
Now we are ready to to review just a

153
00:12:12,914 --> 00:12:18,120
little bit more dynamic execution behavior
of the kernel.

154
00:12:18,120 --> 00:12:24,281
So as we mentioned there will be four
thread blocks, executing this

155
00:12:24,281 --> 00:12:28,388
kernel for our small 4 by 3 C example.
So we

156
00:12:28,388 --> 00:12:33,724
are going here we show the work that the
threadblok 0,0 is

157
00:12:33,724 --> 00:12:39,422
going to be doing.
So you know what, for threadbox 0,0

158
00:12:39,422 --> 00:12:44,850
the block index at the x and y dimensions
are both 0.

159
00:12:44,850 --> 00:12:48,200
So this means that when we calculate the
row index and

160
00:12:48,200 --> 00:12:53,790
column index the these values are solely
going to be determined

161
00:12:53,790 --> 00:12:56,940
by the thread index in the x and y
dimension.

162
00:12:56,940 --> 00:13:02,420
So here we show that that row and column

163
00:13:02,420 --> 00:13:07,840
indices will be 0 and 1 and for all the
four threads.

164
00:13:07,840 --> 00:13:12,162
So let's take a look at the work that
thread 0,0 is going to do.

165
00:13:12,162 --> 00:13:17,220
Thread 0,0 is going to have row value of 0
and column value of 0.

166
00:13:17,220 --> 00:13:19,600
So it's going to be accessing

167
00:13:19,600 --> 00:13:23,450
the zeros row of a A and the zeros column
of B.

168
00:13:23,450 --> 00:13:26,850
We are showing a horizontal arrow in this
direction

169
00:13:26,850 --> 00:13:30,010
here and a vertical arrow down the zeros
column

170
00:13:30,010 --> 00:13:33,342
of B to indicate all the memory accesses
that

171
00:13:33,342 --> 00:13:37,940
thread 0,0 is going to do in block 0,0.

172
00:13:37,940 --> 00:13:44,740
And then, similarly for thread 1,1, it's
going to to, to have a

173
00:13:44,740 --> 00:13:50,320
row value of 1 and column value of 1.
So it's going to be accessing

174
00:13:50,320 --> 00:13:55,800
row 1 and column 1 row 1 of A and column 1
of B.

175
00:13:55,800 --> 00:14:01,365
So we're also showing this with the
horizontal arrow going through the row

176
00:14:01,365 --> 00:14:07,240
1 and vertical arrows going through column
1 in the in the picture.

177
00:14:07,240 --> 00:14:09,920
So this picture shows the work

178
00:14:09,920 --> 00:14:16,100
that all the four threads in block 0,0
will be doing when executing that kernel.

179
00:14:16,100 --> 00:14:22,620
And in fact your, you, your encouraged to
use very small numbers, such

180
00:14:22,620 --> 00:14:25,900
as the ones that we provided here to test
your kernel and make

181
00:14:25,900 --> 00:14:30,080
sure that you, you can look at all the
numbers and review the

182
00:14:30,080 --> 00:14:32,680
correctness of these numbers before you
start

183
00:14:32,680 --> 00:14:35,310
to use your kernel on large matrices.

184
00:14:38,440 --> 00:14:44,778
So just for a little bit more insight we
would look at the

185
00:14:44,778 --> 00:14:51,530
work for block 0, 1.
In this case the block, the Y dimension

186
00:14:51,530 --> 00:14:57,550
the Y dimension block index is still 0,
but the X dimension block index is 1.

187
00:14:57,550 --> 00:15:03,140
So in, in this case the row indexes remain
the same because the row index,

188
00:15:03,140 --> 00:15:08,780
indexes are calculated based on the y
block index which is 0.

189
00:15:08,780 --> 00:15:16,210
However the x index of the block the x
dimension of the block index is now 1.

190
00:15:16,210 --> 00:15:22,610
So we effectively shifted the common value
of the threads to

191
00:15:22,610 --> 00:15:28,814
2, 3, compared to block 0, 0.
So now we see that all

192
00:15:28,814 --> 00:15:32,662
the threads in this thread block are going
to

193
00:15:32,662 --> 00:15:38,080
be processing the upper right corner of
the C matrix.

194
00:15:38,080 --> 00:15:45,115
And so if we look at thread 0, 0 in this
block, it's going to be still

195
00:15:45,115 --> 00:15:52,999
accessing row 0 of A, but it's going to
uh,uh, access accessing column 2 of B.

196
00:15:52,999 --> 00:15:54,069
And then you

197
00:15:54,069 --> 00:15:59,228
will generate that product and write into
C 0, 2.

198
00:15:59,228 --> 00:16:04,378
And the more interesting situa-,
comparison is

199
00:16:04,378 --> 00:16:09,528
that for thread 0, 1 and thre-, thre-,

200
00:16:09,528 --> 00:16:14,608
thread 1, 0 and thread 1, 1.
For those two threads

201
00:16:14,608 --> 00:16:19,249
the they are actually going to be
processing column

202
00:16:19,249 --> 00:16:23,010
3 of C, which is outside the valid range.

203
00:16:23,010 --> 00:16:24,970
So, both of those threads are going to

204
00:16:24,970 --> 00:16:29,270
fail the condition test of the if
statement.

205
00:16:29,270 --> 00:16:32,990
So the, those thread these two threads
will not be doing

206
00:16:32,990 --> 00:16:36,690
any work or doing any kind of memory
access in the kernel.

207
00:16:37,694 --> 00:16:44,899
ex-, exec-, execute in the kernel.
So this gives us a, a good illustration

208
00:16:44,899 --> 00:16:51,961
of the runtime behavior, and I'd like to
encourage you to finish the analysis

209
00:16:51,961 --> 00:16:58,176
of, of, of this example for block 1,0 and
block 1,1, okay?

210
00:16:58,176 --> 00:17:04,161
So, at this point, we have finished the
the coverage

211
00:17:04,161 --> 00:17:09,348
of the simple matrix multiplication kernel
in CUDA.

212
00:17:09,348 --> 00:17:10,524
And you

213
00:17:10,524 --> 00:17:15,770
are ready for the lab.
And what I like to reiterate is that

214
00:17:15,770 --> 00:17:21,140
when you test your kernel you should start
with some very small matrices.

215
00:17:21,140 --> 00:17:26,398
Such as the dimension that we had, we
shown in this in this, in this example.

216
00:17:26,398 --> 00:17:31,728
You can even start with even smaller, like
2 by 2 matrices, and so

217
00:17:31,728 --> 00:17:35,910
that you can use the kind of data value
that you know for sure,

218
00:17:35,910 --> 00:17:39,026
you can just calculate by hand and verify
that

219
00:17:39,026 --> 00:17:42,442
your kernel is, is doing all the right
things.

220
00:17:42,442 --> 00:17:46,512
And then before you can you, you apply the
kernel to much

221
00:17:46,512 --> 00:17:52,472
bigger matrices that, of the data that we
pro-,uh, supply for your testing.

222
00:17:52,472 --> 00:17:58,960
So in, at this point you, we are at the
end of week one.

223
00:17:58,960 --> 00:18:01,030
So I like to also encourage

224
00:18:01,030 --> 00:18:04,960
you to take the quiz as soon as possible
to make

225
00:18:04,960 --> 00:18:09,420
sure that you solidify all your
understanding of the week one material.

226
00:18:09,420 --> 00:18:11,650
For those of you who'd like to learn more,
I'd

227
00:18:11,650 --> 00:18:15,852
like to encourage you to read the
textbook, section 4.3.

228
00:18:15,852 --> 00:18:17,588
Thank you.